<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
	<title>Unicode 字符折叠 | Elasticsearch: 权威指南 | Elastic</title>
    <!-- Give IE8 a fighting chance -->
    <!--[if lt IE 9]>
    <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
    <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
    <![endif]-->
	<link rel="stylesheet" type="text/css" href="../static/styles.css" />
</head>
<body>
<div class="main-container">
    <section id="content">
        
        <div class="content-wrapper">
            <section id="guide" lang="zh_cn">
                <div class="container">
                    <div class="row">
                        <div class="col-xs-12 col-sm-8 col-md-8 guide-section">
                            <div style="color:gray; word-break: break-all; font-size:12px;">原文地址: <a href="https://www.elastic.co/guide/cn/elasticsearch/guide/current/character-folding.html" rel="nofollow">https://www.elastic.co/guide/cn/elasticsearch/guide/current/character-folding.html</a>, 版权归 www.elastic.co 所有<br/>
                            英文版地址: <a href="https://www.elastic.co/guide/en/elasticsearch/guide/current/character-folding.html" rel="nofollow">https://www.elastic.co/guide/en/elasticsearch/guide/current/character-folding.html</a>
                            </div>
                        <!-- start body -->
                  <div class="page_header">
<b>请注意:</b><br>本书基于 Elasticsearch 2.x 版本，有些内容可能已经过时。
</div>
<div id="content">
<div class="breadcrumbs">
<span class="breadcrumb-link"><a href="index.html">Elasticsearch: 权威指南</a></span>
»
<span class="breadcrumb-link"><a href="languages.html">处理人类语言</a></span>
»
<span class="breadcrumb-link"><a href="token-normalization.html">归一化词元</a></span>
»
<span class="breadcrumb-node">Unicode 字符折叠</span>
</div>
<div class="navheader">
<span class="prev">
<a href="case-folding.html">« Unicode 大小写折叠</a>
</span>
<span class="next">
<a href="sorting-collations.html">排序和整理 »</a>
</span>
</div>
<div class="section">
<div class="titlepage"><div><div>
<h2 class="title">
<a id="character-folding"></a>Unicode 字符折叠<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/220_Token_normalization/50_Character_folding.asciidoc">edit</a>
</h2>
</div></div></div>
<pre class="literallayout">在多语言((("Unicode", "character folding")))((("tokens", "normalizing", "Unicode character folding")))处理中，`lowercase` 语汇单元过滤器(token filters)是一个很好的开始。但是作为对比的话，也只是对于整个巴别塔的惊鸿一瞥。所以 &lt;&lt;asciifolding-token-filter,`asciifolding` token filter&gt;&gt; 需要更有效的Unicode _字符折叠_ (_character-folding_)工具来处理全世界的各种语言。((("asciifolding token filter")))</pre>

<pre class="literallayout">`icu_folding` 语汇单元过滤器(token filters) (provided by the &lt;&lt;icu-plugin,`icu` plug-in&gt;&gt;)的功能和 `asciifolding` 过滤器一样， ((("icu_folding token filter")))但是它扩展到了非ASCII编码的语言，例如：希腊语，希伯来语，汉语。它把这些语言都转换对应拉丁文字，甚至包含它们的各种各样的计数符号，象形符号和标点符号。</pre>

<pre class="literallayout">`icu_folding` 语汇单元过滤器(token filters)自动使用 `nfkc_cf` 模式来进行大小写折叠和Unicode归一化(normalization)，所以不需要使用 `icu_normalizer` ：</pre>

<div class="pre_wrapper lang-js">
<pre class="programlisting prettyprint lang-js">PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_folder": {
          "tokenizer": "icu_tokenizer",
          "filter":  [ "icu_folding" ]
        }
      }
    }
  }
}

GET /my_index/_analyze?analyzer=my_folder
١٢٣٤٥ <a id="CO146-1"></a><i class="conum" data-value="1"></i></pre>
</div>
<div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO146-1"><i class="conum" data-value="1"></i></a></p>
</td>
<td align="left" valign="top">
<p>阿拉伯数字 <code class="literal">١٢٣٤٥</code> 被折叠成等价的拉丁数字: <code class="literal">12345</code>.</p>
</td>
</tr>
</table>
</div>
<p>如果你有指定的字符不想被折叠，你可以使用 <a href="http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html" class="ulink" target="_top"><em>UnicodeSet</em></a>(像字符的正则表达式) 来指定哪些Unicode才可以被折叠。例如：瑞典单词 <code class="literal">å</code>,<code class="literal">ä</code>, <code class="literal">ö</code>, <code class="literal">Å</code>, <code class="literal">Ä</code>, 和 <code class="literal">Ö</code> 不能被折叠，你就可以设定为： <code class="literal">[^åäöÅÄÖ]</code> (<code class="literal">^</code> 表示 <em>不包含</em>)。这样就会对于所有的Unicode字符生效。</p>
<div class="pre_wrapper lang-js">
<pre class="programlisting prettyprint lang-js">PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "swedish_folding": { <a id="CO147-1"></a><i class="conum" data-value="1"></i>
          "type": "icu_folding",
          "unicodeSetFilter": "[^åäöÅÄÖ]"
        }
      },
      "analyzer": {
        "swedish_analyzer": { <a id="CO147-2"></a><i class="conum" data-value="2"></i>
          "tokenizer": "icu_tokenizer",
          "filter":  [ "swedish_folding", "lowercase" ]
        }
      }
    }
  }
}</pre>
</div>
<div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO147-1"><i class="conum" data-value="1"></i></a></p>
</td>
<td align="left" valign="top">
<p>`swedish_folding`语汇单元过滤器(token filters) 定制了 `icu_folding`语汇单元过滤器(token filters)来不处理那些大写和小写的瑞典单词。</p>
</td>
</tr>
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO147-2"><i class="conum" data-value="2"></i></a></p>
</td>
<td align="left" valign="top">
<p><code class="literal">swedish</code> 分析器首先分词，然后用`swedish_folding`语汇单元过滤器来折叠单词，最后把他们走转换为小写，除了被排除在外的单词： <code class="literal">Å</code>, <code class="literal">Ä</code>, 或者 <code class="literal">Ö</code>。</p>
</td>
</tr>
</table>
</div>
</div>
<div class="navfooter">
<span class="prev">
<a href="case-folding.html">« Unicode 大小写折叠</a>
</span>
<span class="next">
<a href="sorting-collations.html">排序和整理 »</a>
</span>
</div>
</div>

                  <!-- end body -->
                        </div>
                        <div class="col-xs-12 col-sm-4 col-md-4" id="right_col">
                        
                        </div>
                    </div>
                </div>
            </section>
        </div>
    </section>
</div>
<script src="../static/cn.js"></script>
</body>
</html>